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The present paper complements the hardware proposal for a fully distributed parallel 
processor presented in "A Large Scale, Homogeneous, Fully Distributed Parallel Machine, I," 
by describing a suitable software structure for its operating system management. It is 
shown that the most basic operating system functions can be performed on a purely local 
basis, i.e. in such a fashion that identical pieces of the operating systems are concurrently 
executed by each of the processing el ... 
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Previous research has shown that the SPEC benchmarks achieve low miss ratios in 
relatively small instruction caches. This paper presents evidence that current software- 
development practices produce applications that exhibit substantially higher instruction- 
cache miss ratios than do the SPEC benchmarks. To represent these trends, we have 
assembled a collection of applications, called the Instruction Benchmark Suite (IBS), that 
provides a better test of instruction-cache performance. We discuss th ... 
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December 2004 Proceedings of the 37th annual International Symposium on 
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This paper analyzes memory access scheduling and virtual channels as mechanisms to 
reduce the latency of main memory accesses by the CPU and peripherals in web servers. 
Despite the address filtering effects of the CPU's cache hierarchy, there is significant locality 
and bank parallelism in the DRAM access stream of a web server, which includes traffic 
from the operating system, application, and peripherals. However, a sequential memory 
controller leaves much of this locality and parallelism unex ... 
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Virtual memory page translation tables provide mappings from virtual to physical 
addresses. When the hardware controlled Translation Lookaside Buffers (TLBs) do not 
contain a translation, these tables provide the translation. Approaches to the structure and 
management of these tables vary from full hardware implementations to complete software 
based algorithms. The size of the virtual address space used by processes is rapidly growing 
beyond 32 bits of address. As the utilized ad ... 
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Ll3 ^ terms, review 

The current trend of computer system technology is toward CPUs with rapidly increasing 
processing power and toward disk drives of rapidly increasing density, but with disk 
performance increasing very slowly if at all. The implication of these trends is that at some 
point the processing power of computer systems will be limited by the throughput of the 
input/output (I/O) system. A solution to this problem, which is described and evaluated in 
this paper, is disk cache 
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Exploitation of large amounts of instruction level parallelism requires a large amount of 
connectivity between the shared register file and the function units; this connectivity is 
expensive and increases the cycle time.This paper shows that the new class of transport 
triggered architectures requires fewer ports on the shared register file than traditional 
operation triggered architectures. This is achieved by programming data-transports instead 
of operations. Experimen ... 
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Jeffrey Scott Vitter 

June 2001 ACM Computing Surveys (CSUR), volume 33 issue 2 
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Data sets in large applications are often too massive to fit completely inside the computers 
internal memory. The resulting input/output communication (or I/O) between fast internal 
memory and slower external memory (such as disks) can be a major performance 
bottleneck. In this article we survey the state of the art in the design and analysis of 
external memory (or EM) algorithms and data structures, where the goal is to exploit 
locality in order to reduce the I/O costs. We consider a varie ... 

Keywords: B-tree, I/O, batched, block, disk, dynamic, extendible hashing, external 
memory, hierarchical memory, multidimensional access methods, multilevel memory, 
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Compiling for distributed-memory machines has been a very active research area in recent 
years. Much of this work has concentrated on programs that use arrays as their primary 
data structures. To date, little work has been done to address the problem of supporting 
programs that use pointer-based dynamic data structures. The techniques developed for 
supporting SPMD execution of array-based programs rely on the fact that arrays are 
statically defined and directly addressable. Recursive data s ... 

Keywords: dynamic data structures 
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Dynamic memory allocators (malloc/free) rely on mutual exclusion locks for protecting the 
consistency of their shared data structures under multithreading. The use of locking has 
many disadvantages with respect to performance, availability, robustness, and 
programming flexibility. A lock-free memory allocator guarantees progress regardless of 
whether some threads are delayed or even killed and regardless of scheduling policies. This 
paper presents a completely lock-free memory allocator. It uses ... 

Keywords: async-signal-safe, availability, lock-free, malloc 
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A very long instruction word (VUW) processor exploits parallelism by controlling multiple 
operations in a single instruction word. This paper describes the architecture and compiler 
tradeoffs in the design of iWarp, a VLIW single-chip microprocessor developed in a joint 
project with Intel Corp. The iWarp processor is capable of specifying up to nine operations 
in an instruction word and has a peak performance of 20 million floating-point operations 
and 20 million integer operations per sec ... 
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To exploit instruction level parallelism, compilers for VLIW and superscalar processors often 
employ static code scheduling. However, the available code reordering may be severely 
restricted due to ambiguous dependences between memory instructions. This paper 
introduces a simple hardware mechanism, referred to as the memory conflict buffer, which 
facilitates static code scheduling in the presence of memory store/load dependences. 
Correct program execution is ensured by the ... 
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Full text available: *P ]pdf(14S MB) ; ' ' 
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Many studies have investigated performance improvement through exploiting instruction- 
level parallelism (ILP) with a particular architecture. Unfortunately, these studies indicate 
performance improvement using the number of cycles that are required to execute a 
program, but do not quantitatively estimate the penalty imposed on the cycle time from the 
architecture. Since the performance of a microprocessor must be measured by its execution 
time, a cycle time evaluation is required as well as a cy ... 

19 Special session on memory wall: Fighting the memory wall with assisted execution 
Michel Dubois 

April 2004 Proceedings of the 1st conference on Computing frontiers 

Full text available: ^.&dft231.J8. KB) Additional Information: fujj.citation, abstract, rejejences, index terms 

Assisted execution is a form of simultaneous multithreading in which a set of auxiliary 
"assistant" threads, called nanothreads, is attached to each thread of an application. 
Nanothreads are lightweight threads which run on the same processor as the main 
(application) thread and help execute the main thread as fast as possible. Nanothreads 
exploit resources that are idled in the processor because of hazards due to program 
dependencies and memory access delays. Assisted execution has the po ... 

Keywords: cache memories, latency tolerance, prefetching, simultaneous multithreading, 
superscalar processors 
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Dr. Jeffrey S. Steinman, Jennifer W. Wong 

June 2003 Proceedings of the seventeenth workshop on Parallel and distributed 
simulation 

Full text available: f||pdf(3QS.39 KB) 

s Additional Information: full citation, abstract, index terms 
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This paper provides an overview of the SPEEDESpersistence framework that is currently 
used to automatecheckpoint/restart for the Joint Simulation System. Thepersistence 
framework interfaces are documented in thispaper and are proposed standards for the 
StandardSimulation Architecture/The persistence framework fundamentally keeps trackof 
memory allocations and pointer references within ahigh-speed internal database linked with 
applications.With persistence, an object, and the collection of object ... 
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